Skip to content

Feature: Handle read only columns#437

Open
driv3r wants to merge 6 commits intomainfrom
feat/handle-generated-columns
Open

Feature: Handle read only columns#437
driv3r wants to merge 6 commits intomainfrom
feat/handle-generated-columns

Conversation

@driv3r
Copy link
Copy Markdown
Contributor

@driv3r driv3r commented Apr 16, 2026

ref: #400 & @proton-lisandro-pin

The back and forth is taking a bit of time, lets speed this up, I've cherry-picked your commits so the attribution is there, CLA was signed on original PR, if you could sign the one here as well it would be 👍

Handle MySQL Generated Columns (STORED and VIRTUAL) in Data Replication

This PR adds support for MySQL generated columns (both VIRTUAL and STORED) to Ghostferry, enabling proper handling of computed columns during selective data replication.

Problem Statement

MySQL 8.0.23 introduced significant changes to how generated columns are handled in binary log ROW events:

  • Virtual columns are completely omitted from binlog events (not stored on disk)
  • Stored columns are included in binlog events (computed once and persisted)

Without special handling, Ghostferry would fail or produce incorrect results when replicating tables with generated columns, as it would attempt to insert values into columns that cannot be modified or would have incorrect column positions.

Solution

This PR implements a comprehensive solution with four key components:

  1. Row Expansion - Detects when MySQL omits virtual columns from binlog events and re-inserts nil sentinels to maintain consistent full-schema column indexing throughout the pipeline.

  2. Insert Value Filtering - Filters out generated column values before constructing INSERT statements, allowing only modifiable columns to be inserted while using proper column metadata for value escaping.

  3. Unsigned Integer Normalization - Fixed the order of operations: row expansion happens before unsigned integer normalization, ensuring consistent full-schema column indexing throughout.

  4. Verification with Generated Columns - Includes all columns (including generated) in fingerprint queries to detect divergence when computed values differ between source and target databases.

Changes

Core Implementation

  • dml_events.go: Modified INSERT event handling to filter out generated columns and improved binlog event processing for MySQL 8.0.23+ compatibility
  • table_schema_cache.go: Added column classification and filtering utilities:
    • IsColumnGenerated() - Identifies virtual and stored columns
    • NonGeneratedColumnNames() - Returns only insertable columns
    • FilterGeneratedColumnsOnRowData() - Removes generated column values from rows
  • row_batch.go: Updated row batch handling for filtered column data
  • iterative_verifier.go: Updated verification logic to handle generated columns

Test Coverage

  • Added unit tests for generated column handling with edge cases:
    • Virtual columns before unsigned integer columns
    • Generated columns before JSON columns (ensures JSON casting is preserved)
    • Mixed VIRTUAL and STORED columns
  • Added integration tests confirming:
    • Verification detects divergence in computed generated column values
    • Stored and virtual generated columns are handled correctly
    • Interrupt/resume scenarios work with generated columns

Edge Cases Handled

  • ✅ Virtual columns before unsigned integer columns
  • ✅ Generated columns before JSON columns (JSON casting preserved)
  • ✅ Mixed VIRTUAL and STORED columns in same table
  • ✅ MySQL version differences (pre-8.0.23 vs 8.0.23+)
  • ✅ Interrupt/resume with generated columns
  • ✅ Verification with computed column divergence detection

Testing

  • 5 new commits with comprehensive testing
  • Tests for critical edge cases involving generated columns and other column types
  • Integration tests verifying both inline and checkpoint verification modes
  • Fixed race condition in interrupt/resume tests

Related Issue

Closes #338

This PR modifies all `INSERT` logic so virtual (a.k.a generated) MySQL
columns are not attempted to insert into, which otherwise breaks
the ferrying process.

See also #338.
@driv3r driv3r self-assigned this Apr 16, 2026
@driv3r driv3r added enhancement New feature or request go Pull requests that update Go code labels Apr 16, 2026
@driv3r driv3r marked this pull request as ready for review April 16, 2026 19:28
@driv3r driv3r requested review from a team, austenLacy, forge33 and milanatshopify April 16, 2026 19:29
@driv3r
Copy link
Copy Markdown
Contributor Author

driv3r commented Apr 16, 2026

Hey @plisandro ! There was a bunch more things to tackle here, everything should be in place now, feel free to review and test, as well as sign the CLA mentioned in the checks for the contributions 👍

@ghost
Copy link
Copy Markdown

ghost commented Apr 17, 2026

Hey @plisandro ! There was a bunch more things to tackle here, everything should be in place now, feel free to review and test, as well as sign the CLA mentioned in the checks for the contributions 👍

This is much appreciated, thank you! Been testing it today and your PR seems to work well. CLA is now signed as well 😄

@driv3r
Copy link
Copy Markdown
Contributor Author

driv3r commented Apr 17, 2026

Hey @plisandro I'm not super familiar with the CLA stuff, but the error says:

  • @plisandro: Sign the CLA and comment "I have signed the CLA!" to re-run the checks and have your PR reviewed.

leave a comment and lets see, also I think you may have committed under your original account, so you might need to leave/sign under it as well 🤔

@plisandro
Copy link
Copy Markdown

I have signed the CLA!

@plisandro
Copy link
Copy Markdown

leave a comment and lets see, also I think you may have committed under your original account, so you might need to leave/sign under it as well 🤔

Done, and apologies for the confusion - i'm in the process of merging accounts now, and this was actually the last contribution left with the old one 🤦

@driv3r
Copy link
Copy Markdown
Contributor Author

driv3r commented Apr 17, 2026

@plisandro no problem!

Comment thread table_schema_cache.go

// Evaluates whether a TableSchema column is generated, by name.
func (t *TableSchema) IsColumnNameGenerated(name string) bool {
for _, col := range t.Columns {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we worried about iterating through columns every time? I know there can't be that many, so it's probably not an issue, but we do it in a few places in this file.

Copy link
Copy Markdown

@plisandro plisandro Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this locally, and has no discernible runtime impact when compared to a (previous) version which filters columns beforehand.

As noted in #400 (comment) , the problem with doing any pre-processing on columns is that they won't later align with DML and Binlog updates, so the logic to handle these becomes a mess.

Copy link
Copy Markdown
Contributor

@milanatshopify milanatshopify Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but is it worth it. See comment on row_batch.go

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that "a mess" includes being a tad slower 😄 To process row updates with generated columns missing on table definitions means that you effectively need to keep track of those columns, and their indeces, to filter them out on every update - which also involves re-aligning row data.

Filtering generated columns when SQL is constructed is actually simpler/cheaper. Either way, i tested this in a setup with tables with 50+ columns and couldn't measure any significant difference.

Comment thread row_batch.go
flattened[rowIdx*rowSize+colIdx] = col
flattened := make([]interface{}, 0, len(e.values)*len(e.columns))

for _, row := range e.values {
Copy link
Copy Markdown
Contributor

@milanatshopify milanatshopify Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now going through all rows, all columns, then all columns again. #rows * #cols^2. Or am I reading it wrong?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking because this will add up when you have 86 columns in a large table, and we're doing string comparisons. no?

Copy link
Copy Markdown

@plisandro plisandro Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original change followed TableSchema as it exists today, where ignored and compressed columns are indexed by name.

Didn't seem to have any relevant performance hit on my test setup, but if this is a consideration, we could use a hashmap instead. Note it'd have to be lazily-initialized, as TableSchema is initialized ad-hoc all over ghostferry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trouble with virtual generated columns

4 participants